Abstract…

Introduction

General introduction of the raw dataset:

Some data of squirrels in New York central park were collected starting from October 6th 2018 to 20th over a 14-day period. Some of their characteristics like ages and fur colors and some of the activities like sounds and locations were recorded.

Motivation and initial question:

Squirrels are found everywhere, and it’s observed that some places have more squirrels than others, but is there any trend of where they stay with respect to their colors, ages, activities or all other features? Doing an analysis using squirrel census data may answer the question.

Main final goal:

The main final goals of the project are to make maps according to the census and build functions to predict locations of particular squirrels if their characteristics are provided so that people can use the website to look for the kinds of squirrels they like.

Inspirations:

The maps-making process shown in class caught our eyes since it is a clear and direct way to convey the information. A website called The Squirrel Census (https://www.thesquirrelcensus.com/about ) did research on the squirrels too, but it doesn’t provide any predictions, so we aim to develop a prediction system. Also, in order to attract more people to the website, some interactive plots will be made to fully introduce the raw dataset. The census provides only the data of central park in 2018, so other datasets like data of central park in other years or data in 2018 of other places will also be collected to make comparisons of any location changes.

Method

Source:

The original raw data we used to analysis is from NYC Open Data, it includes 3,023 observations in total and some of their characteristics and corresponding locations are recorded

Preliminary work:

First off, the original raw dataset only includes the data from 2018 in central park, other two datasets were found as extra supporting materials to compare with the raw model. One dataset contains information about not only the squirrels in central park, but also in the whole new york city. Another one is about characteristics and behaviors of different animals and squirrels are also included.

Data cleaning:

For data tidy and cleaning, the categorical variables were transformed to numeric ones for analysis and model building. We didn’t discard the missing values directly but code them as 0 since there are a lot of them, and omitting them might lose the validity of the prediction. The dates were also cleaned. For other comparison plots, same data cleaning step was followed.

Model building process:

Since the outputs are both longitudinal and latitudinal, we expected to make two linear functions, with longitudinal and latitudinal being the outputs separately against predictors, and combine the two outputs in the end. We built several models for both longitudinal and latitudinal outcomes using different methods (p-value, step-wise (both backward and forward at the same time), criterion-based, and LASSO). The following explanations are for longitudinal only and the latitudinal one follows the exactly same procedures.

The first step is to throw all the numerical variables into the model and check the p-value, the variables are shift + age + primary_fur_color + location + activity + reaction + sounds. Although hectare is also a numerical variable, it’s not included because the users of the model would not have the information of how many squirrels are there within a specific hectare, but they only have the information about the characteristics of specific squirrels that they want to look for. The variables with p-value less than 0.05 were removed from the model, and the model built with remaining variables was checked again to make sure that all of them had p-value less than 0.05. So, the first model candidate was produced with predictors being ‘shift’, ’ age’, ‘activity’, ‘reaction’, ‘sounds’.

Then, we selected model using automatic procedure, specifically step-wise regression procedure. Backward, Forward or step-wise methods might produce different results, but we chose to use step-wise since it gives a single ‘best’ model. As the result, except for the location, all other 6 variables are included in this model, which is the second model candidate.

Next, we used criterion-based procedure. The model with the largest adjusted R-square valued along with smallest AIC and BIC values are chosen to be the model candidate. It turned out that it also had all those 6 variables as the one in automatic procedure.

LASSO model selection method was then used. After looking for the best lamda value, the third model candidate has all seven predictors, which means no variable was deleted from the selection procedure.

We have three different models as the final ‘best’ model candidate for now, and they are all nested within each other. We choose the ‘best’ model according two criteria, adjusted R-squared value and RMSE. For longitudinal model, the final predictors have 6 predictors (shift + age + primary_fur_color + activity + reaction + sounds) since it has the highest adjust R-squared value and pretty much similar RMSE distribution as all other models.

As for the latitudinal model, it has 5 predictors (sounds + primary_fur_color + reaction + activity + shift), but all other models candidates have 6 predictors. Since the RMSE values and adjusted R-squared are approximately same among all models, the principal of parsimony tells us to choose the the most succinct model.

Results

Data summaries:

First graph we drew was ‘Number of Observations’ v.s. ‘Time of Day’, and morning and afternoon data were separated and found out that squirrels tend to be more active in the afternoon or at night time. However, the limitation of the data is that we were not able to get the exact time period of their activities but only either morning or evening, we can assume they are present prior to sunset since they should be busy collecting the food when there is sunlight. Second graph we drew was ‘Number of Observations’ v.s. ‘Primary Fur Color’, it’s clearly shown that different number of observations were made in different days and there is no clear pattern. Squirrels were observed to be the most active on Oct.7 and Oct.13, and they clearly became less active in last few days. Generally, the gray squirrels are the most massive and black ones are the fewest. The color of cinnamon was also pretty frequently observed with some color-not-identified ones. Third graph we drew was a pie chart indicating the distribution of squirrels by their physiological age. The majority (89%) of the population is juvenile while the remaining 11% is adult. It’s not sure how their age stage was determined by the observers, maybe by their sizes. The limitation is only ‘adult’ and ‘juvenile’ were categorized, but the predictions might be more valid if other stages like ‘baby’ or ‘old’ are provided.

Some interesting plots:

Main model:

Comparions:

Discussion

Findings:

Conclusion